AITopics

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Germany > Saxony > Dresden (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Neural Information Processing SystemsFeb-10-2026, 22:46:41 GMT

40b60852a4abdaa696b5a1a78da34635-Paper-Conference.pdf

audio-video representation, learning, representation, (17 more...)

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > Canada > Quebec > Montreal (0.04)
(4 more...)

Genre: Research Report > New Finding (0.67)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Ishii, Masato, Hayakawa, Akio, Shibuya, Takashi, Mitsufuji, Yuki

Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

arXiv.org Artificial IntelligenceDec-9-2025

W e introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. T o achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. W e extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.

artificial intelligence, machine learning, natural language, (21 more...)

2512.07209

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceNov-18-2025

ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation

Sun, Jiahui, Wang, Weining, Sun, Mingzhen, Yang, Yirong, Zhu, Xinxin, Liu, Jing

Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of multi-scale temporal self-attention and group cross-modal attention. Furthermore, we stack the 2D latents from MDSA into a unified 3D latent space, which is processed by a spatio-temporal diffusion Transformer. This design efficiently models spatiotemporal dependencies, enabling the generation of high-fidelity synchronized audio-video content while reducing computational overhead. Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.

artificial intelligence, machine learning, video, (16 more...)

2511.12072

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)

Neural Information Processing SystemsOct-8-2025, 13:18:24 GMT

40b60852a4abdaa696b5a1a78da34635-Supplemental-Conference.pdf

artificial intelligence, encoder, machine learning, (18 more...)

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Germany > Saxony > Dresden (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Neural Information Processing SystemsOct-8-2025, 13:18:18 GMT

MA ViL: Masked Audio-Video Learners Po-Y ao Huang

Empirically, MA ViL achieves state-of-the-art audio-video classification performance on AudioSet (53.3 mAP) and VGGSound (67.1% accuracy), surpassing recent self-supervised models and supervised models that utilize external labeled data.

artificial intelligence, machine learning, natural language, (20 more...)

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > Canada > Quebec > Montreal (0.04)
(4 more...)

Genre: Research Report > New Finding (0.67)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Marinoni, Christian, Gramaccioni, Riccardo Fosco, Grassucci, Eleonora, Comminiello, Danilo

Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information

arXiv.org Artificial IntelligenceOct-8-2025

The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.

artificial intelligence, machine learning, natural language, (18 more...)

2510.0606

Country: Europe > Italy (0.15)

Genre: Research Report (0.64)

Industry:

Media (0.69)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.50)

arXiv.org Artificial IntelligenceMar-26-2025

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Lin, Yan-Bo, Lin, Kevin, Yang, Zhengyuan, Li, Linjie, Wang, Jianfeng, Lin, Chung-Ching, Wang, Xiaofei, Bertasius, Gedas, Wang, Lijuan

In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html

artificial intelligence, large language model, natural language, (17 more...)

2503.20782

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia (0.04)

Genre: Research Report (0.82)

Industry:

Media (0.47)
Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceFeb-8-2025

UniForm: A Unified Diffusion Transformer for Audio-Video Generation

Zhao, Lei, Feng, Linfeng, Ge, Dongxu, Yi, Fangqiu, Zhang, Chi, Zhang, Xiao-Lei, Li, Xuelong

As a natural multimodal content, audible video delivers an immersive sensory experience. Consequently, audio-video generation systems have substantial potential. However, existing diffusion-based studies mainly employ relatively independent modules for generating each modality, which lack exploration of shared-weight generative modules. This approach may under-use the intrinsic correlations between audio and visual modalities, potentially resulting in sub-optimal generation quality. To address this, we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs. Extensive experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks. Our demos are available at https://uniform-t2av.github.io/.

artificial intelligence, machine learning, natural language, (19 more...)

2502.03897

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > China (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Neural Information Processing SystemsJan-23-2025, 05:22:42 GMT

Review for NeurIPS paper: Labelling unlabelled videos from scratch with multi-modal self-supervision

Weaknesses: Required clarifications: there are some parts of the work that would require clarification, see below: * The description of the exact algorithm is not completely clear to me in the paper (and the appendix). I understand that code is provided but it should be clarified in the paper. In particular, is it a pure alternate approach? How many examples are sampled for the clustering stage (is N equal to the number of example in the dataset?) If I understand correctly, thanks to the probabilistic formulation, once the data is reclustered there is no need to reinit the last linear layer, is that correct? - If no, it is unclear to me how to apply the algorithm in an online fashion (see later for a related question).

labelling unlabelled video, neurips paper, video, (13 more...)

Technology: Information Technology > Artificial Intelligence (0.52)